[SPARK-56295][PYTHON] Add Java error classes to Python side#55100
[SPARK-56295][PYTHON] Add Java error classes to Python side#55100gaogaotiantian wants to merge 3 commits intoapache:masterfrom
Conversation
| ERROR_CLASSES_MAP = json.loads(ERROR_CLASSES_JSON) | ||
|
|
||
|
|
||
| def get_error_classes() -> dict[str, dict]: |
There was a problem hiding this comment.
Will this impact python Spark Connect client-only installations? They would not have the jars installed.
There was a problem hiding this comment.
Good question. So for now, it won't generate any error. The connect client would miss the java side error classes. As long as connect client only uses python error classes it should be fine.
There was a problem hiding this comment.
I wonder if we can add a simple validation for that. Otherwise, the error might fail with a weird error, e.g., if error code in JVM is mistakenly used in Spark Connect code.
There was a problem hiding this comment.
Hm. Actually I wonder if we just add a linter to sync JSON and Python error classes ....
There was a problem hiding this comment.
We can add any linter that we want, but we need to first figure out how we should organize our error classes.
We have two questions:
- how many "real" JSON files do we want (source of truth)
- how many copies do we want (to make things like packaging easier)
We can have linter to enforce our ideas so we don't have to worry about how to make it work at this point, just what we want.
For this current PR, we have 2 source of truth (JAVA + Python) and no copy. Python has Python + JAVA and JAVA has JAVA only. The good thing about this is:
- Python always has all the error classes
- No sync needed
- Python can have its dedicated error classes (client error for example)
We can have a single error class file put in JAVA source code. We probably have to copy it to python because we need to have that accessible to connect clients (we can actually read it from classic though).
The question is - we probably do have error classes that are specific to Python code, putting those in the JAVA json file is a bit weird to me. That's why I chose this structure.
This indeed introduced the problem of connect client not able to get the JAVA error class - which is actually our current behavior (before this PR). Python does not understand JAVA error class at all before this PR and we are fine. If we want the client to have all the error classes, we can copy the JAVA side json to python and enforce the sync checking in CI.
There was a problem hiding this comment.
What I prefer is just to make it easier to maintain. e.g., I got some questions about this error class files time to time still, and why are they different.
Another approach might be just add a check if the error class exists in Exception class itself that is only enabled when testing (e.g., SPARK_TESTING environment variable with an assert).
There was a problem hiding this comment.
why are they different
I think I mentioned this in the long comment I wrote above :) Basically here will be some error classes that are Python specific (like connect client errors), and it would be weird to put it to Java source code. Java also doesn't need to know this.
We can have all kind of validations for this, for now we only need to decide how to organize these error classes.
What changes were proposed in this pull request?
Include Java side error classes in Python so python can generate consistent error messages as Java with the same error class + message parameters.
Why are the changes needed?
Currently we have two separate
error-conditions.jsonfiles, for Python and Java. I think it's historical reason that we can't easily load Java's json on Python so we created a Python specific one. The issue is that two files are unsynced. It's possible that we want to raise the same error on both Python and Java, and we either need to create a new error class on Python json file or forget to do that and break message generation.Ideally we should have a single source of truth, which is achievable now by merging every existing Python error classes into Java one. But there could be unexpected consequences and it's difficult to split them back once it's merged. Moreover, the Python file contains some python-specific error classes which may not fit into the java file.
As a compromise, we keep the Python file and load the Java file from Python. We keep the current structure and make it more flexible. In the future, if we have a clear plan of which way we should go, we can always eliminate the Python json file.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
A new test is added to test Python can raise Java-specific error class.
Was this patch authored or co-authored using generative AI tooling?
No.